Reproducible science:

A bioinformatic service perspective

January Weiner

Core Unit for Bioinformatics, BIH@Charité

Three stories

  • Predicting tuberculosis (TB) onset in patients
  • Operating an RNA-Seq pipeline at the CUBI
  • How not to do gene set enrichments

Core principles

  • Reproducibility: the analysis should be reproducible with minimal effort (always a compromise!); also here: good science practices
  • Accountability: we should be able to trace back every result to its underlying data (e.g.: gene set enrichment to differential expression of individual genes) with minimal effort; also here: user empowerment
  • Interpretability: we should be able to interpret the results in the given biological or medical context; also: collaboration or service?

Predicting the onset of TB

Predicting the onset of TB

Predicting the onset of TB

Key papers:

Weiner 3rd J, Maertzdorf J, Sutherland JS, Duffy FJ, Thompson E, Suliman S, McEwen G, Thiel B, Parida SK, Zyla J, Hanekom WA. Metabolite changes in blood predict the onset of tuberculosis. Nature Communications. 2018 Dec 6;9(1):5208.

Suliman S, Thompson EG, Sutherland J, Weiner 3rd J, Ota MO, Shankar S, Penn-Nicholson A, Thiel B, Erasmus M, Maertzdorf J, Duffy FJ. Four-gene pan-African blood signature predicts progression to tuberculosis. American Journal of Respiratory and Critical Care Medicine. 2018 May 1;197(9):1198-208.

Duffy FJ, Weiner 3rd J, Hansen S, Tabb DL, Suliman S, Thompson E, Maertzdorf J, Shankar S, Tromp G, Parida S, Dover D. Immunometabolic signatures predict risk of progression to active tuberculosis and disease outcome. Frontiers in Immunology. 2019 Mar 22;10:527.

Implementing the core principles:

  • Introduced blinded predictions for the validation set
  • All data published
  • Results, paper and supplementary materials generated and written completely in R markdown, available from github

Running an RNA-Seq pipeline at CUBI

Running an RNA-Seq pipeline at CUBI

Implementing the core principles:

  • Reproducibility:
    • keeping track of all analyses
    • –,,– software versions
    • software environments preserved with conda
    • data and meta-data stored in SODAR “forever”
  • Accountability ensured by providing custom interactive interfaces \(\rightarrow\) read only mode! Not for making the analyses
  • Interpretability
    • frequent updates and discussions
    • incorporating research hypotheses in the pipeline
    • collaborations rather than service

How not to do gene set enrichments

  • two groups of patients: G1 and G2
  • two conditions: COVID vs controls
  • do the two groups react differently to COVID?

How not to do gene set enrichments

How not to do gene set enrichments

How not to do gene set enrichments

  • 25% of papers on transcriptomics with a venn diagram fall for this fallacy

  • correct interpretation: interactions, disco score (Domaszewska et al. and Weiner, 2017)
  • paper sources as rmarkdown available from github

Thank you

  • Personal github: https://github.com/january3/
  • CUBI github: https://github.com/bihealth/
  • Scholar profile: http://tiny.cc/y8o3vz